Smoothing is a very powerful technique used all across data analysis. It is designed to estimate \(f(x)\) when the shape is unknown, but assumed to be smooth. The general idea is to group data points into strata that are expected to have similar expectations and compute the average or fit a simple model in each strata. We will use the 2008 presidential election polls.
polls_2008
## Source: local data frame [543 x 14]
##
## Pollster start_date end_date N
## (chr) (time) (time) (int)
## 1 Marist College 2008-11-03 2008-11-03 804
## 2 GWU (Lake/Tarrance) 2008-11-02 2008-11-03 400
## 3 DailyKos.com (D)/Research 2000 2008-11-01 2008-11-03 1100
## 4 IBD/TIPP 2008-11-01 2008-11-03 981
## 5 Rasmussen 2008-11-01 2008-11-03 3000
## 6 ARG 2008-11-01 2008-11-03 1200
## 7 Reuters/ C-SPAN/ Zogby 2008-10-31 2008-11-03 1226
## 8 Harris Interactive 2008-10-30 2008-11-03 3946
## 9 Marist College 2008-11-02 2008-11-02 635
## 10 NBC/WSJ 2008-11-01 2008-11-02 NA
## .. ... ... ... ...
## Variables not shown: population_type (chr), McCain (dbl), Obama (dbl),
## Barr (chr), Nader (chr), Other (chr), Undecided (chr), Margin (chr),
## diff (dbl), day (dbl)
For each day starting June 1, 2008 we compute the average of polls that started that day. We will denote this predicted difference with \(Y\) and the days with \(X\). Below we create and plot this dataset and fit a regression line.
dat <- filter(polls_2008, start_date>="2008-06-01") %>%
group_by(X=day) %>%
summarize(Y=mean(diff))
dat %>% ggplot(aes(X, Y)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Note that we model \(f(x) = \mbox{E}(Y \mid X=x)\) with a line we do not appear to describe the trend very well. Note for example that on September 4 (day -62) the Republican Convention was held. This gave McCain a boost in the polls which can be clearly seen in the data. The regression line does not capture this.
To see this more clearly we note that points above the fitted line (green) and those below (purple) are not evenly distributed. We therefore need an alternative more flexible approach.
resids <- ifelse(lm(Y~X, data=dat)$resid >0, "+", "-")
dat %>% mutate(resids=resids) %>%
ggplot(aes(X, Y)) +
geom_point(cex=5,pch=21) +
geom_smooth(method = "lm", se = FALSE) +
geom_point(aes(X,Y,color=resids), cex=4)
We will explore ways of estimating \(f(x)\) that do not assume it is linear.
Instead of fitting a line, let’s go back to the idea of stratifying and computing the mean. This is referred to as bin smoothing. The general idea is that the underlying curve does not vary wildly, what we refer to as smooth. If the curve is enough then in small bins, the curve is approximately constant. If we assume the curve is constant, then all the \(Y\) in that bin have the same expected value. For example, in the plot below, we highlight points in a bin centered at day -125 as well as the points of a bin centered at day -55 , if we use bins of a week. We also show the fitted mean values for the \(Y\) in those bins with dashed lines (code not shown):
By computing this mean for bins around every point, we form an estimate of the underlying curve \(f(x)\). Below we show the procedure happening as we move from the smallest value of \(X\) to the largest.
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing:
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
## Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
## Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
## Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
## Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
## Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
## Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
## Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
## Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
## Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
## Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
## Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
## Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
## Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
## Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
## Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
## Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
## Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
## Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
## Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
## Rplot99.png Rplot100.png Rplot101.png Rplot102.png
## Rplot103.png Rplot104.png Rplot105.png Rplot106.png
## Rplot107.png Rplot108.png Rplot109.png Rplot110.png
## Rplot111.png Rplot112.png Rplot113.png Rplot114.png
## Rplot115.png Rplot116.png Rplot117.png Rplot118.png
## Rplot119.png Rplot120.png Rplot121.png Rplot122.png
## Rplot123.png Rplot124.png Rplot125.png Rplot126.png
## Rplot127.png Rplot128.png Rplot129.png Rplot130.png
## Rplot131.png 'binsmoother1.gif'
## Output at: binsmoother1.gif
The final result looks like this (code not shown):
mod <- ksmooth(dat$X, dat$Y, kernel="box", bandwidth = span)
bin_fit <- data.frame(X=dat$X, .fitted=mod$y)
ggplot(dat, aes(X, Y)) +
geom_point(cex=5) + geom_line(aes(x=X, y=.fitted),
data=bin_fit, color="red")
Note that the final project is quite wiggly. One reason for this is that each time the window moves 2 points change. We can attenuate this somewhat by taking weighted averages that give the center point more weight and far away less points.
In this animation we see that points on the edge get less weight:
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing:
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
## Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
## Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
## Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
## Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
## Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
## Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
## Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
## Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
## Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
## Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
## Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
## Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
## Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
## Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
## Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
## Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
## Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
## Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
## Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
## Rplot99.png Rplot100.png Rplot101.png Rplot102.png
## Rplot103.png Rplot104.png Rplot105.png Rplot106.png
## Rplot107.png Rplot108.png Rplot109.png Rplot110.png
## Rplot111.png Rplot112.png Rplot113.png Rplot114.png
## Rplot115.png Rplot116.png Rplot117.png Rplot118.png
## Rplot119.png Rplot120.png Rplot121.png Rplot122.png
## Rplot123.png Rplot124.png Rplot125.png Rplot126.png
## Rplot127.png Rplot128.png Rplot129.png Rplot130.png
## Rplot131.png 'binsmoother2.gif'
## Output at: binsmoother2.gif
Note that the estimate is smoother now.
mod <- ksmooth(dat$X, dat$Y, kernel="normal",
bandwidth = span)
bin_fit2 <- data.frame(X=dat$X, .fitted=mod$y)
ggplot(dat, aes(X, Y)) +
geom_point(cex=5) + geom_line(aes(x=X, y=.fitted), data=bin_fit2, color="red")
There are several functions in R that implement bin smoothers. One example is ksmooth shown above. However, in practice, we typically prefer methods that use slightly more complex models than fitting a constant. The final result above, for example, is still somewhat wiggly. Methods such as loess, which we explain next, improve on this.
Local weighted regression (loess) is similar to bin smoothing in principle. The main difference is that we approximate the local behavior with a line or a parabola. This permits us to expand the bin sizes, which stabilizes the estimates. Below we see lines fitted to two bins that are slightly larger than those we used for the bin smoother (code not shown). We can use larger bins because fitting lines provide slightly more flexibility.
As we did for the bin smoother, we show 12 steps of the process that leads to a loess fit (code not shown):
span <- 0.05
dat2 <- dat %>%
inflate(center = unique(dat$X)) %>%
mutate(dist = abs(X - center)) %>%
filter(rank(dist) / n() <= span) %>%
mutate(weight = (1 - (dist / max(dist)) ^ 3) ^ 3)
dat2 %>% filter(center %in% c(-125, -55)) %>%
ggplot(aes(X, Y)) +
geom_point(aes(alpha = weight)) +
geom_smooth(aes(group = center, frame = center, weight = weight),
method = "lm", se = FALSE) +
geom_vline(aes(xintercept = center, frame = center), lty = 2) +
geom_point(shape = 1, data = dat)
Note that now that we are fitting lines instead of constant, we can fit lines to larger windows
And then we fit a line locally at each point and keep the predicted value at that point:
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing:
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
## Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
## Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
## Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
## Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
## Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
## Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
## Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
## Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
## Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
## Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
## Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
## Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
## Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
## Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
## Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
## Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
## Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
## Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
## Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
## Rplot99.png Rplot100.png Rplot101.png Rplot102.png
## Rplot103.png Rplot104.png Rplot105.png Rplot106.png
## Rplot107.png Rplot108.png Rplot109.png Rplot110.png
## Rplot111.png Rplot112.png Rplot113.png Rplot114.png
## Rplot115.png Rplot116.png Rplot117.png Rplot118.png
## Rplot119.png Rplot120.png Rplot121.png Rplot122.png
## Rplot123.png Rplot124.png Rplot125.png Rplot126.png
## Rplot127.png Rplot128.png Rplot129.png Rplot130.png
## Rplot131.png 'loess.gif'
## Output at: loess.gif
There are three other important differences between loess and the typical bin smoother. The first is that rather than keeping the bin size the same, loess keeps the number of points used in the local fit the same. This number is controlled via the span argument which expects a proportion. For example, if N is the number of data points and span=0.5, then for a given \(x\) , loess will use the 0.5*N closest points to \(x\) for the fit. The second difference is that, when fitting the parametric model to obtain \(f(x)\), loess uses weighted least squares, with higher weights for points that are closer to \(x\). The third difference is that loess has the option of fitting the local model robustly. An iterative algorithm is implemented in which, after fitting a model in one iteration, outliers are detected and down-weighted for the next iteration. To use this option, we use the argument family="symmetric".
The final result is a smoother fit than the bin smoother since we use larger sample sizes to estimate our local parameters:
mod <- loess(Y~X, degree=1, span = span, data=dat)
loess_fit <- augment(mod)
ggplot(dat, aes(X, Y)) +
geom_point(cex=5) + geom_line(aes(x=X, y=.fitted), data=loess_fit, color="red")
Note that different spans give us different smooths:
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing:
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
## Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
## Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
## Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
## Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
## Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
## Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
## Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
## Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
## Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
## Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
## Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
## Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
## Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
## Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
## Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
## Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
## Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
## Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
## Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
## Rplot99.png Rplot100.png Rplot101.png Rplot102.png
## Rplot103.png Rplot104.png Rplot105.png Rplot106.png
## Rplot107.png Rplot108.png Rplot109.png Rplot110.png
## Rplot111.png Rplot112.png Rplot113.png Rplot114.png
## Rplot115.png Rplot116.png Rplot117.png Rplot118.png
## Rplot119.png Rplot120.png Rplot121.png Rplot122.png
## Rplot123.png Rplot124.png Rplot125.png Rplot126.png
## Rplot127.png Rplot128.png Rplot129.png Rplot130.png
## Rplot131.png 'loesses.gif'
## Output at: loesses.gif